2024-10-22
PAPER
methods: how do different text mining approaches spotlight different
historical trends/fluctuations/moments in labor-environmental (labor?
environmental?) issues since the 1960s? (i.e., compare the methods)
- period distinctiveness (tf-ipf)
- textual averages (triples)
- convergence/divergence (topic models + word co-occurrence
networks)
UPDATES
- re-subset speeches with new set of environmental + labor keywords
(two versions)
- re-ran speech counts, token counts,
- re-ran tf-ipf code (this time labor speech data was small enough to
work)
- workshopping triples code–giving me some trouble. i’ll just move on
to topic models + co-occurrence networks for next week, for sake of
previewing results beyond tf-ipf.
KEYWORDS
how to capture the widest possible net of “env’t” and “labor”
without capturing beyond’ env’t and labor. the fewest, vaguest possible
terms that still get us what we want. i.e., striking balance between
capturing most expansive umbrella of “env’t” and “labor” without:
- casting too wide a net and capturing non-environmental and non-labor
speeches
- overdetermining the issues/terms associated with each
i.e., my analysis should tell me when something like “globalization”
is articulated specifically as a labor issue or “urbanization” is
articulated specifically as an environmental issue, without preemptively
compiling ALL globalization or urbanization speeches into the
enviro-labor speeches dataset.
environmental/labor keywords v1: specific
environmental v1: specific
labor v1: specific
environmental/labor keywords v2: broad
environmental v2: broad
labor v2: broad
SPEECHES
environmental/labor speeches: v1
enviro speeches v1 (sample of n=1 per year):
labor speeches v1 (sample of n=1 per year):
enviro-labor speeches v1 (sample of n=1 per year):
environmental/labor speeches: v2
speeches per year v2:
enviro speeches v2 (sample of n=1 per year):
labor speeches v2 (sample of n=1 per year):
enviro-labor speeches v2 (sample of n=1 per year):
TOKENS
top 25 bigrams:
enviro v1
enviro v2
labor v1
labor v2
enviro-labor v1
enviro-labor v2
top 10 tokens by year:
enviro v1
enviro v2
labor v1
labor v2
enviro-labor v1
enviro-labor v2
top 10 bigrams by year:
enviro v1
enviro v2
labor v1
labor v2
enviro-labor v1
enviro-labor v2
TF-IPF
(run using v2 keywords, “environmental” and “labor”)
labor tf-ipf
20-yr periods:
10-yr periods:
5-yr
periods:
enviro-labor tf-ipf
20-yr periods:
10-yr periods:
5-yr
periods:
NEXT STEPS
- CLC presentation thursday (5 mins, 2-4 slides)
- subsetting:
- final enviro/labor keywords?
- figure out encoding problem
- analysis:
- validation:
- are there meaningful differences between daily vs. bound
speeches?
- validating scraped 2016-2024 data cleaning/processing against
stanford 1873-2016 data cleaning/processing